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Abstract 

Background: Dissecting the genomic spectrum of clinical disease entities is a challenging task. Recursive 
partitioning (or classification trees) methods provide powerful tools for exploring complex interplay among genomic 
factors, with respect to a main factor, that can reveal hidden genomic patterns. To take confounding variables into 
account, the partially linear tree-based regression (PLTR) model has been recently published. It combines regression 
models and tree-based methodology. It is however computationally burdensome and not well suited for situations 
for which a large number of exploratory variables is expected. 

Methods: We developed a novel procedure that represents an alternative to the original PLTR procedure, and 
considered different selection criteria. A simulation study with different scenarios has been performed to compare the 
performances of the proposed procedure to the original PLTR strategy. 

Results: The proposed procedure with a Bayesian Information Criterion (BIC) achieved good performances to detect 
the hidden structure as compared to the original procedure. The novel procedure was used for analyzing patterns of 
copy-number alterations in lung adenocarcinomas, with respect to Kirsten Rat Sarcoma Viral Oncogene Homolog 
gene (KRAS) mutation status, while controlling for a cohort effect. Results highlight two subgroups of pure or nearly 
pure wild-type KRAS tumors with particular copy-number alteration patterns. 

Conclusions: The proposed procedure with a BIC criterion represents a powerful and practical alternative to the 
original procedure. Our procedure performs well in a general framework and is simple to implement. 
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Background 

Recent advances in large-scale genomic technologies pro- 
vide an unprecedented amount of data that offer new 
insights into the molecular portraits of diseases. This 
information enables to dissect a heterogeneous disease 
entity into more homogeneous subentities that can be 
relevant for clinical purposes. 

This problem is particularly appealing in oncology 
where molecular classification of tumors, that are based 
on the status of specific targeted therapy, rely mainly 
upon a single molecular event but overlook tumoral 
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genomic complexity. For most solid epithelial tumors, 
these genomic events are primarily DNA mutations that 
give selective growth advantages to tumor cells. 

A classical example in non-small-cell lung cancer 
(NSCLC) is the activating EGFR (epidermal growth 
factor receptor) mutation that predicts the sensitivity 
to EGFR tyrosine kinase inhibitors. EGFR-mutant lung 
adenocarcinoma is nowadays almost considered as a dis- 
tinct disease entity. Such is not the case for KRAS (Kirsten 
Rat Sarcoma Viral Oncogene Homolog gene) mutation 
that represents one of the most common mutations in 
NSCLC. With the exception of its well-known mutually 
exclusive relationship with EGFR mutation, the clinical 
utility of KRAS mutation status has not been clearly 
demonstrated [1]. Moreover, it is still unclear whether 
subgroups exist within KRAS wild-type or KRAS mutated 
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tumors. The identification of more homogeneous molecu- 
lar subgroups with respect to KR AS mutational status may 
provide new genomic taxonomy of NSCLC tumors, that 
may help for the advancement of personalized medicine. 

The aim of the clinical study that prompted this 
methodological work was to decipher heterogeneity of 
lung adenocarcinomas with respect to KRAS mutation 
status based upon whole-genome copy-number alter- 
ations. Copy-number alteration (CNA) is one of the main 
type of genomic alterations that is linked to genome insta- 
bility and represents a key feature of human carcinomas 
[2]. In previous cancer studies, association between spe- 
cific CNAs and point mutations have been reported such 
as, for example, the relationship between EGFR mutations 
and copy-gains of 7pl2 (which harbors EGFR gene) in 
lung adenocarcinomas. However, few investigations have 
been performed for studying the relationships between 
KRAS mutation and CNAs. 

Identifying homogeneous subgroups, with respect to 
a main factor, based on the complex interplay among 
genomic alterations is a difficult task that cannot be easily 
done with standard regression models. 

In contrast, recursive partitioning (or tree-based) 
methods such as CART (Classification And Regression 
Tree) [3] is a flexible and powerful alternative for explor- 
ing high-order interaction between explanatory variables. 
From a data mining perspective, the purpose of such 
approach is to decompose a data space recursively into 
smaller areas that are defined by the set of explanatory 
variables and tree-structured. The hypothesis space is the 
set of all possible hyper-rectangular areas. These areas 
are more homogeneous with respect to the main fac- 
tor as compared to the whole population. The analysis 
of the patterns of these areas, that are defined by the 
explanatory variables, can provide meaningful biologi- 
cal insights. In the context of non-parametric statistical 
methods, random forests [4] is the classical extension 
to tree-based methods with many available R packages 
(for a few: VarSelRF [5], SRF [6], RF [7]). As compared 
to tree-based methods, a forest that consists of thou- 
sands of unpruned trees is more stable and well suited for 
prediction. However, random forests loose the easy inter- 
pretability of CART, which represents the key objective 
when dissecting the clinico-biological spectrum of clinical 
disease entities. 

For clinical epidemiology studies, an important draw- 
back of classical tree-based methodology is that it does 
not provide a straightforward way of adjusting for con- 
founding variables. In practice, confounding and explana- 
tory variables are considered in the same way. Thus, the 
final tree is a mixture of confounder and explanatory 
variables lacking of clear interpretation and whose joint 
effects are distorted. This problem was of particular con- 
cern in our clinical studies since our series was composed 



of two different sub-populations (Asian and Caucasian 
patients). In lung adenocarcinomas, KRAS mutation is 
found in about one third of the tumours in Caucasian 
populations, as opposed to less than one tenth is Asian 
populations. Thus, in our study, it was mandatory to 
adjust for this confounding factor. 

In a pioneering a work, Chen et al. [8] have intro- 
duced a new class of regression models, called partially 
linear tree-based regression models (PLTR). This new 
framework has been proposed for genetic epidemiology 
studies in order to assess complex joint gene-gene and 
gene-environment effects taking into account confound- 
ing variables. In practice, PLTR models represent a new 
class of semi-parametric regression models that integrates 
the advantages of generalized linear regression and tree- 
structure models. The linear part is used to model the 
main effects of confounder variables and the nonparamet- 
ric tree part is used to capture the distributional shape 
of the data relying on the complex joint effects of mul- 
tiple explanatory variables. In their article, Chen et al. 
have proposed a four-step selection and testing proce- 
dure for identifying the optimal tree while adjusting for 
linear (on the generalized linear scale) confounding vari- 
ables. This methodology has been recently extended for 
considering multivariate outcomes [9]. However, Chens 
et al. procedure heavily relies on resampling, which is 
computationally burdensome and not well-suited to situ- 
ations for which a large number of explanatory variables 
is expected. In the present work, we propose and evaluate 
an alternative procedure with different selection criteria, 
which considers separately the objectives of selection and 
testing. 

We first describe the novel procedure with three differ- 
ent selection criteria. It corresponds to a modified PLTR 
procedure with four steps, of which the two first are com- 
mon to the one proposed by Chen et al. A simulation 
study with different scenarios is presented that compares 
the power of the proposed procedure to the original PLTR 
strategy. The proposed procedure is used to decipher 
heterogeneity of lung adenocarcinomas, with respect to 
KRAS mutation, based on copy-number alterations. 

Methods 

In the following we present our novel procedure with 
different selection criteria. The first two steps are simi- 
lar to those of the original PLTR procedure (Chen et al.) 
whereas the last two steps are new. The four steps 
are summarized in Figure 1 and presented in details 
below. 

Denote Y the outcome of interest (or the main fac- 
tor for the application considered in this work), X the 
confounding variables (to be modeled linearly), and Z 
the explanatory variables. The PLTR model is specified 
by: 
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Step 1. Construction of a maximal tree 



+ 



a 
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Step 4. Testing 
Computed p-value for the selected tree. 



Figure 1 Flow-chart of the four steps of the method. 



g (E (Y|X, Z)) = X'6 + p T F (T (Z)) , 



(1) 



where g(-) is a known link function (generalized linear 
model), F (T (Z)) is a vector of indicator variables repre- 
senting the leaves of the tree T (Z). 

Step 1 : Construction of a maximal tree 

The maximal tree is constructed as follows: 

• Initialization: fit the generalized linear model 
(GLM) g (E (Y|X, Z)) = X'0 (o) + £ (0) 



• Iterations: iterate the following steps starting with 
k = 1. 

- fit the tree part: construct a maximal tree 
model based on Z, using as 
offset 

- fit the leaves of the tree: fit the GLM 

g (E (Y|X, Z)) = PjPf (r<*> (Z)) using 
X'tf**" 1 ) as offset 

- fit the parametric part: fit the GLM 

g (E (Y|X, Z)) = X'0^, with ^ } F (F (/:) (Z)) 
using as offset 
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• Ending conditions: the algorithm stops when the 
estimates of 0 stabilize within a pre-specified range or 
after a pre-specified number of iterations. 

In the above procedure, an offset is a predictor variable 
included in the model with coefficient fixed equal to one. 

In the construction of the tree, the goodness of a can- 
didate split is assessed for each node by the deviance of a 
generalized linear model fitted in the node by considering 
X!0 as the offset. More precisely, the goodness of a candi- 
date split is the deviance of the parent node minus the sum 
of the deviance of the two child nodes. The recursive par- 
titioning stops when the number of cases in each terminal 
node is smaller than a pre-assigned threshold. 

Step 2: Construction of the sequence of nested subtrees 

At the end of the previous step, the estimated tree Tr 
is a maximal tree which generally overfits the data. The 
second step constructs a sequence of nested subtrees of 
Tr. 

The PLTR model (1) obtained from the previous step is 

g (E (Y|X, Z)) = X'6 Tr + p TR F (T r (Z)) , (2) 

where Tr (Z) represents the maximal tree at convergence, 
R being its size (number of terminal nodes or leaves). 

Let£>(X; %) be the deviance computed under the null 
hypothesis 

Ho: g(E(Y\X,Z))=X'0 + p o (3) 

and D (X, T(Z) ; <9 r ) the deviance computed under the 
alternative hypothesis 

Hng(E (Y|X, Z)) = X'O + PtF (T (Z)) , (4) 

with a tree T(Z) c T R (Z). 

Let r < R be the pre-specified maximal size of subtrees 
to be considered. A sequence of nested candidates sub- 
trees T2(Z) C • • • C T r (Z) of Tr(Z) is constructed as 
follows: 

• The procedure is forward with T\ (Z) representing 
the root of the tree Tr(Z). Let 

| IJ^Z), w = 1, ...,«/ J be the set of subtrees of 
Tr(Z) with ; leaves, such that for all m = 1, . . . , rij, 
T h i is a subtree of TJ 1 : 2}_i (Z) c T™ (Z). 

• Tj (Z) is the subtree of Tr(Z) with ; leaves such that 
Tj-i (Z) c Tj (Z), chosen as 7) = Tf with 

m* = arg max \d (x ; 0 O ) —D (x, T™(Z) ; § T m)^ . 
Step 3: tree selection 

We select one of the trees of the sequence T\ C T2 C 
• • • C T r . For this selection step, we use either 



• penalized maximum likelihood methods: the Akai'ke 
information criterion(AIC) [10] and the Bayesian 
information criterion (BIC) [11], 

• or a cross-validation method. 

The competing models to be considered are: 

Mj :g(E (Y|X, Z)) = X!6 Tj + p Tj F (Tj (Z)) , ; = 1, . . . , r 

(5) 

with F (T\ (Z)) = 1 representing the situation where the 
tree is reduced to the root node, that is the null model (3). 

BIC andAIC criteria 

The BIC criterion for the model Aij is 

BIC (Mj) = 2C (Mj\0 Tj , Pt) ~ 8j log (N) , 

N being the sample size, <Sy the number of free param- 
eters involved in the model Aij (8j = dim (0) + j) and 
the log-likelihood for the model Mj. 

The model selected by the BIC criterion is M hlc = 
MjHc, where y blc is defined by 

; bic = arg max BIC (Mj) . 

j—l,...,r \ / 

We denote T hic = TjUc the tree used in the model M hic . 
The AIC criterion for the model Mj is 

AIC (Mj) = 2C (Mj\0 Tr fi Tj ) - 28j, 

with 8j = dim (0) + The model selected by the AIC 
criterion is M* 1C = Mj^c where j aic is defined by 

; aic = arg max AIC (Mj) . 

j—l,...,r \ / 

We denote T aic = T^-ak the tree used in the model A^ aic . 
Cross-validation criterion 

As an alternative to the penalized maximum likelihood 
criteria presented above, we propose a cross-validation 
procedure on the global PLTR model for selecting the 
optimal tree. The competing models Mj are those defined 
in (5). 

The original sample A is randomly partitioned into K 
equal size subsamples: 

K 

A = [J with At n A m = 0 for all i ^ m 

1=1 

For i = 1, . . . ,K, denotes by A-t = Um#£ A m the I th 
training set, while At is the corresponding validation set. 

For each i = 1, ...,/<T, the following steps are per- 
formed: 
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• fit the PLTR model (1) with the sample A-t. At the 
end of step 1, the fitted PLTR model is 

g (E (Y|X, Z)) = X'6 T e + P T tF (r| (Z)) , 

where T% (Z) represents the maximal tree at 
convergence. 

• Construct a sequence of r — 1 nested subtrees 

r| (Z) , . . . , (Z) as in step 2, and determine the 
underlying PLTR models sequence: 

Mj:g(E (Y|X, Z))=X f 6 T f + ^f(t/(Z)), ; = 1, . . . , r 

• For each ; = 1, . . . , r, use the validation sample At to 
compute the cross-validation error CV/ of the model 

The mean cross-validation error is 

The selected model is M CY = jM/cv where ; cv is defined 
by 

y cv = arg min CV;. 
;=i,...,r 

We denote r cv = 7)cv the tree used in the model M. cy . 
Step 4: Testing 

To test the null hypothesis (3) versus the alternative (4), 
we use the statistic 

A = 2C (M Ul )-2C (Mho). 

As the model Mm is n °t obtained as a maximum like- 
lihood estimate, this statistic does not follow the "naive" 
X 2 (j — 1) distribution where j is the number of leaves of 
the tree used in hAu\ • F an et al. [12] demonstrated that for 
a variety of models involving non parametric estimators, 
such generalized likelihood ratio statistics follow a scaled 
chi-squared distribution. In our case, this implies that for a 
defined number of leaves j the distribution of A is a scaled 
chi-squared distribution: 

mA ~ X 2 (b). (6) 

As the theoretical determination of m and b is cumber- 
some, Fan et al. propose to simulate the null distribution 
for estimating the constants m and b. In the following, 
we use the conditional parametric bootstrap procedure 
described below: 

• Generate a new outcome Y b from the fitted model 
g(E(Y\X)) = X'Oo + fh 



• Fit the complete model (2) with Y b as the outcome 
(as in step 1) 

g (e (y & |x, z)) = x'e b + PjkF (r b R (Z)) 

• Repeat the previous step until the size R is greater 
than ; 

• Construct a sequence of candidate optimal subtrees 
| T b ; k = 2, . . . ,y j as in step 2 (where we take r = j) 
and compute 

A b = 2C (M^ - 2C (M^) 

• Repeat this process B times 

• Estimate b and m from the empirical moments of 
sample A 1 ,..., A 5 . 

Once b and m have been estimated, a p-vdlue is calcu- 
lated as/? = ¥(X > mA) withX - x 2 (b). 

Results 

Simulation study 
Simulation protocol 

A simulation study with a binary outcome (logit link) was 
conducted to evaluate and compare the performances of 
the proposed procedure (with the three selection criteria) 
to the original one proposed by Chen et al. 

We have considered three different scenarios for which 
we used PLTR logistic model similar to the one considered 
in Chen et al. 

• In scenario 1, we simulated four Bernoulli variables 
Gi, G 2 , G 3 , G 4 with probabilities 0.3, 0.25, 0.18 and 
0.22 respectively, and an outcome Bernoulli variable, 
denoted Y, according to the following model (null 
hypothesis): 

logit P (Y = l|Gi, G 2 , G 3 , G 4 ) = Pi + OGi 

with Pi = log(0.61), 0 = log(2). Here Gi is the 
confounding variable and G 2 , G3, G4 are the 
explanatory variables. 

• In scenario 2, we introduced ten additional Bernoulli 
variables G5, . . . , G14 with probabilities p = 0.5. The 
Bernoulli variable Y is simulated according to the 
following model: 

logitP(7 = l|Gi,...,Gi 4 ) =Pi+OG 1 +p 2 lG 2 =i t G 3 ^o 

+ ^3lG 3 =l,G 4 =0 

+ ^41g 3 =i,g 4 =i- 

with pi = log(0.45), 0 = log(2), p 2 = log(3.5), 
Pz = log(2) and ^4 = log(4.5). This scenario mimics 
joint effects of G 2 , G3, and G4. The corresponding 
tree is displayed in Figure (2). The variables 
G5, . . . , G14 are noise variables unrelated to Y. 
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Simulated tree for scenario 2 



— 1 G% 



G 4 



G 4 



G 2 



G 2 



log 2 



log 4.5 



02 — log 3.5 reference leaf 



Figure 2 Tree used for scenario 2 simulations. The leaves are represented by circles and the number beneath each node represents the real 
value of the coefficient consider in each leaf of the tree. 



• In scenario 3, we considered a deeper tree with 
non-independent explanatory variables G2, . . . , G5. 
The model is: 

logitP(r=l|Gi, . . . , G 15 ) = Pi+ £Gi+A1 G2= o,g 3 =o,g 4 =i 

+ ^31g 2 =o,g 4 =i 
+ A1g 2 =i,g 5 =o 

+ /*5lG 2 =l,G 3 =0,G 5 =l 
+ /*6lG 2 =l,G 3 =l,G 5 =l« 

with pi = log(1.8), 0 = log(L35), 0 2 = log(1.50), 
0 3 = log(2), 04 = log(0.36), 05 = log(2.5) and 
^r3 6 = log(0.36). The corresponding tree is displayed 
in Figure (3). 

The Bernoulli variables Gi, . . . , G5 were generated 
from the following hierarchical model: 

G 0 ~ 23(0.2), logitP(G; = 1) = logit(0.2) + G 0 
for / = 2, . . . , 5 



and 

logitP(Gi = 1) = logit (0.2) + log(2)G 0 . 

Hence the variables Gi, . . . , G5 are marginally 
dependent. The variables G& . . . , G15 are considered 
as noise variables and are generated independently 
from a Bernoulli distribution with p = 0.5. 

For all scenarios the sample was set to n — 2000, and 300 
datasets were simulated. 

Simulation results 

Figures 4 and 5 display the quantile-quantile plots of the 
observed statistics for the "naive" theoretical x 2 distribu- 
tion with degrees of freedom equal to the number of leaves 
minus 1, and for the scaled x 2 distribution (equation 6). 
These figures show that the naive distribution is inade- 
quate; in contrast, the scaled distribution with estimated 
m and b fits well the empirical distribution. 



Simulated tree for scenario 3 



Go — 1 Go 
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G 3 



As = log 2 




G 3 




02 = log 1.5 reference leaf 



Figure 3 Tree used for scenario 3 simulations. The leaves are represented by circles and the number beneath each node represents the real 
value of the coefficient consider in each leaf of the tree. 



Mbogning etol. Journal of Clinical Bioinformatics 2014, 4:6 
http://www.jclinbioinformatics.eom/content/4/1/6 



Page 7 of 1 1 



Trees with 2 leaves Trees with 3 leaves Trees with 4 leaves 
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theoretical quantiles (df = 1 ) theoretical quantiles (df = 2) theoretical quantiles (df = 3) 



Trees with 5 leaves Trees with 6 leaves Trees with 7 leaves 
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theoretical quantiles (df = 4) theoretical quantiles (df = 5) theoretical quantiles (df = 6) 



Trees with 8 leaves 




0 5 10 15 20 25 
theoretical quantiles (df = 7) 
Figure 4 Quantile-quantile plots of the observed statistics versus the "naYve" x 2 quantiles. 



We assess whether or not the trees selected by step 3 
in the sequence of nested trees have the correct num- 
ber of leaves. Under scenario 1 (null hypothesis), a root 
tree (one leaf) is expected. As seen in Table 1, the proce- 
dure with the BIC criterion (BIC) selects the root tree for 
98.3% of the simulations, whereas the Chen et al. proce- 
dure (named BOOT hereafter) succeeds for only 91% of 
the simulations. For the 10-fold cross validation procedure 
(CV), this proportion goes down to 84%, and for the AIC 
criterion (AIC) it is only 47.3%. 

Under scenario 2, the correct number is of 4 leaves. 
As seen from Table 2, BIC has the best performance 
with 44.3% of selected trees with four leaves. Moreover, it 
exhibits the smallest dispersion around the target value. 
In contrast BOOT selects a tree with only 2 leaves for all 



the simulations. The performances of CV are inferior to 
those from BIC, and the dispersion is higher. Finally, AIC 
selects always trees with too many leaves. Similar results 
are obtained with scenario 3 (where the correct number of 
leaves is 6), with increased quality of CV (Table 3). 

For the more complex scenario (scenario 3), we com- 
puted the ten-fold cross-validation generalization error 
for each of the 300 simulated data sets for BIC, AIC and 
CV criteria. The distribution of the generalization errors 
are displayed in Figure 6. CV and BIC have very similar 
errors, while AIC have a slightly increased error. 

In summary, the procedure using the BIC criterion 
consistently outperforms the other procedures. 

We also investigated which variables are present in the 
splits of the trees selected by BIC under scenario 2 and 3 
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Trees with 2 leaves 
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Trees with 8 leaves 




20 25 

theoretical quantiles (df = 7.14, m =1.02) 
Figure 5 Quantile-quantile plots of the observed statistics versus the scaled x 2 quantiles. 



(Tables 4 and 5). For scenario 2, the so-called correct vari- 
ables are G2 to G4, and in scenario 3, G2 to G5. In both 
scenarios, we refer as incorrect variables the ten noise 
variables. Under scenario 2, in 18% of the selected trees, 
at least one noise variable appears; however, all three cor- 
rect variables are present in 44.3% of the selected trees. In 
all cases, at least two correct variables were selected. The 
BIC procedure behaves better under scenario 3, with more 



than 99% of trees involving all four correct variables, while 
noise variables appear in 20% of the trees. 

Analysis of lung adenocarcinomas 
Description of the data 

The dataset considered in this study is based on a French- 
Singaporean study (Merlion study) of 230 patients with 
lung adenocarcinomas [13]. The Western-Europe series 



Table 1 Number of trees by number of leaves, for the 300 
trees selected by the different methods under scenario 1 



Table 2 Number of trees by number of leaves, for the 300 
trees selected by the different methods under scenario 2 



Leaves 
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BOOT 


273 
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BOOT 
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300 
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0 
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0 


cv 


252 
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18 


10 


10 


10 


CV 
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18 


83 


61 


36 


32 


21 


19 


13 


17 


BIC 


295 
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46 
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142 


64 


54 
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5 
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3 
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24 


265 
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Table 3 Number of trees by number of leaves, for the 300 
trees selected by the different methods under scenario 3 



Leaves 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


BOOT 


0 


300 


0 


0 


0 


0 


0 


0 


0 


0 


cv 


0 


0 


41 


16 


22 


154 


22 


12 


17 


16 


BIC 


0 


0 


0 


1 


89 


162 


36 


9 


3 


0 


AIC 


0 


0 


0 


0 


0 


0 


0 


1 


2 


297 



(WE) included 139 tumors and the East- Asian series (EA) 
included 91 tumors. Clinical characteristics were detailed 
in a previous published article [13]. DNA was extracted 
using standard protocols and stored at — 80°C until use. 
Copy number information was issued from Affymetrix 
Genome- Wide Human SNP 6.0 arrays. Inferences about 
the copy number status of each genomic segment (copy 
loss, copy modal, copy gain) were obtained using the mod- 
ified CGHmix classification procedure [14]. In order to 
summarize genomic information while keeping a suffi- 
cient level of resolution, copy number status was aver- 
aged (median estimate) over the 284 main cytogenetic 
bands. Information about KRAS mutation was extracted 
from the targeted mutation profiling performed using the 
Sequenom Massarray 4 platform (Sequenom, San Diego, 
CA). Here, the KRAS mutation status was defined as the 
presence or absence of any mutation within KRAS gene. 
In this dataset, we detected 54 KRAS mutations with 44 
cases (31.6%) from the WE series and 10 cases (10.9%) 
from the EA series. 

We compared the results obtained from the Chen et al. 
procedure [8] to those obtained by the novel procedure 
with the BIC criterion. The "dependent" variable was the 
KRAS mutation status (mutation/wild- type). The cohort 
status ( WE/EA) was the confounding binary variable. The 
284 copy-number alterations (trinomial variable: copy- 
loss, modal, copy gain) were considered as candidate 
explanatory variables. Recursive partitioning stopped as 



CV 



BIC 



AIC 



Figure 6 Distribution of the generalization 10-fold 
cross-validation error for AIC, BIC CV criteria across the 300 
simulated data sets. 



Table 4 Variables selected by the procedure using BIC 
criterion under scenario 2, with global percentages 
between brackets 

Incorrect variables 
0 1 2 3 4 5 

Correct 0 0 0 0 0 0 0 

Variables 1 0 0 0 0 0 0 

2 134(44.66%) 22 (7.33%) 10(3.33%) 0 0 1(0.33%) 

3 112 (37.33%) 19(6.33%) 2(0.66%) 0 0 0 



soon as the number of cases in each terminal node was 
below fifteen. 

Results 

The iterative procedure converged after 15 iterations. The 
trees selected by Chens et al. method and by our pro- 
cedure with the BIC criterion are displayed in Figure 7. 
Chens et al. procedure led to two leaves that sepa- 
rated tumors with and without copy-loss of 3q23. The 
global adjusted p-vdlue associated with the selected tree 
is 0.0055. This model is a simple cohort-adjusted logis- 
tic regression model with 3q23 copy-loss as the unique 
explanatory variable. 

Our procedure with the BIC criterion led to seven 
leaves. We identified: 

(i) two pure or nearly pure wild-type KRAS leaves (with 
53 tumors and only one KRAS mutation) 
characterized by no 3q23 copy-loss and a copy-loss 
for either 9ql2 or 3pll cytoband, 

(ii) a leave with a low rate of KRAS-mutated tumors (8%) 
characterized by no copy-loss of 3q23, 3pll, 9ql2, 
14q23 but a copy gain of 8q23 cytoband, 

(iii) a leave with a medium rate of KRAS-mutated tumor 
(15.5%) with no copy-loss of 3q23, 3pll, 14q23 and 
no copy-gain of 8q23 and lq32 cytoband, 

(iv) the three other leaves were heterogeneous with a 
mixture of wild-type and KRAS-mutated tumors 
(43.5%, 52.6%, 51.2%). 



Table 5 Variables selected by the procedure using BIC 
criterion under scenario 3, with global percentages 
between brackets 



Incorrect variables 



Correct 0 
variables 1 
2 
3 
4 



0 

0 
0 
0 

1 (0.33%) 



1 

0 
0 
0 

1 (0.33%) 



239(79.66%) 50(16.66%) 8 (2.< 



3 4 5 

0 0 0 

0 0 0 

0 0 0 

0 0 0 

1 (0.33%) 0 0 
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(a) BIC (p = 0.011; CV err = 0.23) (b) Original PLTR (p = 0.0055; CV err = 0.23) 
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Figure 7 Optimal tree obtained with the two competing methods on the real data set: (a) BIC selected tree, (b) Original PLTR selected 
tree. The leaves are represented by circles and the number in each leave node represents the number of observations falling inside the node; the 
percentage represented proportion of cases inside the node. 



For the selected tree, the split variables are copy-number 
aberrations of lq32, 3pll, 3q23, 8q23, 9pl2, and 14q23. 
The global p-vdlue associated with this tree is 0.011 with 
a ten-fold cross-validation generalization error of 0.23. 
These results were obtained after adjustment for a signifi- 
cant cohort effect (OR = 0.266, 95% Confidence interval: 
[0.12 - 0.56]) with a higher rate of KRAS for the WE series 
as compared to the EA series. 

We also compared the characteristics of the 53 tumors 
arising from the two pure or nearly pure wild-type KRAS 
leaves as compared to the other tumors. There was no sig- 
nificant difference between the two groups regarding the 
EGFR mutation status (p = 0.94). There was a signifi- 
cantly higher proportion of tumors with a large fraction of 
genome altered (more than 50%) in the pure or nearly pure 
wild-type KRAS group as compared to the other groups 
(p = 1.7 x 10~ 8 ). 

Discussion 

Nowadays, there is a growing interest in deciphering 
the genomic spectrum of clinical disease entities. In this 
context, recursive partitioning methodology provides a 
powerful data mining tool for exploring complex interplay 
between genomic factors, with respect to a main factor, 
that can reveal hidden genomic patterns. The requirement 



of adjusting for confounding factors led Chen et al. to 
develop a semiparametric regression model called PLTR 
together with an iterative algorithm procedure to select 
and test the "optimal" tree. A main drawback of the 
procedure is that it relies on a two levels permutation 
strategy which can become cumbersome and computa- 
tionally expensive. In this work, we propose a novel pro- 
cedure with different selection criteria. As shown from 
the simulation study, the proposed procedure with the 
BIC criterion achieves good power to detect the hidden 
structure as compared to Chens et al procedure. 

When investigating patterns of copy-number alterations 
in lung adenocarcinomas, with respect to KRAS muta- 
tion status and after adjustment for a cohort effect, our 
proposed strategy highlights two subgroups of pure or 
nearly pure wild-type KRAS tumors. These subgroups 
correspond to 53 lung adenocarcinomas having no 3q23 
copy-loss but copy-loss for either 9pl2 or 3pll cytoband. 
It is worth noting that the 3q23 area harbors the PI3KCB 
gene that participates in the PI3K (Phosphatidylinositol 
3-kinase) signaling pathway, well-known to be deregulated 
in many human cancers. Moreover, PI3K is one of the 
main effector pathways of RAS, regulating cell growth, cell 
cycle and cell survival. These wild-type KRAS subgroups 
are not enriched for EGFR mutation (mutually exclusive 
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with KRAS mutation) and are composed of tumors hav- 
ing a proportion of copy-number changes significantly 
higher than expected by chance. The genomic patterns 
of these two wild-type KRAS subgroup are worth further 
investigation. 

Conclusion 

We have proposed a novel recursive partitioning proce- 
dure for deciphering the genomic spectrum of clinical 
disease entities. The proposed procedure represents a 
powerful and practical alternative to the partially linear 
tree-based regression model proposed by Chen et al. [8]. 
Our procedure performs well, is simple to implement, less 
computationally demanding and can be recommended 
for investigating new disease taxonomy. The procedure 
is implemented within an R package known under the 
acronym 'GPLTR' and will be available very soon on the 
CRAN site. 

We plan to use this novel procedure to identify new 
sub-groups of multiple sclerosis treated with interferon- 
beta, with regards to the occurrence of antidrug-antibody 
response, while adjusting for cohort effect. 
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